The multivariate Pólya distribution, named after George Pólya, also called the Dirichlet compound multinomial distribution, is a compound probability distribution, where a probability vector p is drawn from a Dirichlet distribution with parameter vector , and a set of discrete samples is drawn from the categorical distribution with probability vector p. The compounding corresponds to a Polya urn scheme. In document classification, for example, the distribution is used to represent probabilities over word counts for different document types.
Contents |
We are doing N independent draws from a categorical distribution with K categories. Let x=(n1,n2,...,nK) denote the vector of counts, where nk is the number of times category k was drawn. If the parameter of the categorical distribution is given as p=(p1,p2,...,pK), where is the probability to draw value k, the probability distribution for counts, P(x|p) is given by the associated multinomial distribution with parameter p. But now p is not given, but instead considered drawn from a Dirichlet distribution with parameter vector . The resulting compound distribution is obtained by integrating out p:
which results in the following explicit formula:
where is the gamma function, with
The probability mass function may be written more compactly in terms of the beta function, as follows:
where is the beta function.
The one-dimensional version of the multivariate Pólya distribution is known as the Beta-binomial distribution.
The multivariate Pólya distribution is used in automated document classification and clustering, genetics, economy, combat modeling, and quantitative marketing.